AITopics | unit discovery

End-to-end (E2E) keyword search (KWS) has emerged as an alternative and complimentary approach to conventional keyword search which depends on the output of automatic speech recognition (ASR) systems. While E2E methods greatly simplify the KWS pipeline, they generally have worse performance than their ASR-based counterparts, which can benefit from pretraining with untranscribed data. In this work, we propose a method for pretraining E2E KWS systems with untranscribed data, which involves using acoustic unit discovery (AUD) to obtain discrete units for untranscribed data and then learning to locate sequences of such units in the speech. We conduct experiments across languages and AUD systems: we show that finetuning such a model significantly outperforms a model trained from scratch, and the performance improvements are generally correlated with the quality of the AUD system used for pretraining.

acoustic unit, query, unit discovery, (13 more...)

arXiv.org Artificial Intelligence

2407.04652

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
Europe > Czechia > South Moravian Region > Brno (0.04)
Asia > Middle East > Republic of Türkiye (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)
(2 more...)

Add feedback

Variable-rate hierarchical CPC leads to acoustic unit discovery in speech

Cuervo, Santiago, Łańcucki, Adrian, Marxer, Ricard, Rychlikowski, Paweł, Chorowski, Jan

arXiv.org Artificial IntelligenceDec-4-2022

The success of deep learning comes from its ability to capture the hierarchical structure of data by learning high-level representations defined in terms of low-level ones. In this paper we explore self-supervised learning of hierarchical representations of speech by applying multiple levels of Contrastive Predictive Coding (CPC). We observe that simply stacking two CPC models does not yield significant improvements over single-level architectures. Inspired by the fact that speech is often described as a sequence of discrete units unevenly distributed in time, we propose a model in which the output of a low-level CPC module is non-uniformly downsampled to directly minimize the loss of a high-level CPC module. The latter is designed to also enforce a prior of separability and discreteness in its representations by enforcing dissimilarity of successive high-level representations through focused negative sampling, and by quantization of the prediction targets. Accounting for the structure of the speech signal improves upon single-level CPC features and enhances the disentanglement of the learned representations, as measured by downstream speech recognition tasks, while resulting in a meaningful segmentation of the signal that closely resembles phone boundaries.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2206.02211

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Self-supervised language learning from raw audio: Lessons from the Zero Resource Speech Challenge

Dunbar, Ewan, Hamilakis, Nicolas, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceOct-27-2022

Recent progress in self-supervised or unsupervised machine learning has opened the possibility of building a full speech processing system from raw audio without using any textual representations or expert labels such as phonemes, dictionaries or parse trees. The contribution of the Zero Resource Speech Challenge series since 2015 has been to break down this long-term objective into four well-defined tasks -- Acoustic Unit Discovery, Spoken Term Discovery, Discrete Resynthesis, and Spoken Language Modeling -- and introduce associated metrics and benchmarks enabling model comparison and cumulative progress. We present an overview of the six editions of this challenge series since 2015, discuss the lessons learned, and outline the areas which need more work or give puzzling results.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/JSTSP.2022.3206084

2210.15759

Country:

North America > Canada > Ontario > Toronto (0.14)
Europe > Poland > Lower Silesia Province > Wroclaw (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)
(11 more...)

Genre: Overview (1.00)

Industry: Education > Curriculum > Subject-Specific Education (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Cognitive Science (1.00)
(4 more...)

Add feedback

The Zero Resource Speech Challenge 2020: Discovering discrete subword and word units

Dunbar, Ewan, Karadayi, Julien, Bernard, Mathieu, Cao, Xuan-Nga, Algayres, Robin, Ondel, Lucas, Besacier, Laurent, Sakti, Sakriani, Dupoux, Emmanuel

arXiv.org Artificial IntelligenceOct-12-2020

We present the Zero Resource Speech Challenge 2020, which aims at learning speech representations from raw audio signals without any labels. It combines the data sets and metrics from two previous benchmarks (2017 and 2019) and features two tasks which tap into two levels of speech representation. The first task is to discover low bit-rate subword representations that optimize the quality of speech synthesis; the second one is to discover word-like units from unsegmented raw speech. We present the results of the twenty submitted models and discuss the implications of the main findings for unsupervised speech learning.

discovery, representation, synthesis, (13 more...)

arXiv.org Artificial Intelligence

2010.05967

Country: